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(57) ABSTRACT 

A system and method for motion vector extraction and 
computation is embodied in an architecture adapted to 
overlap a data extraction process with a computation process 
and to provide 2-frame store decode with letterbox scaling 
capability, to extract a plurality of parameters usable for 
calculating a motion vector, and to compute motion vectors. 
The architecture is adapted to compute vertical and hori- 
zontal components of motion vectors in back-to-back cycles. 
The architecture includes a motion vector compute pipeline 
which, in a preferred embodiment includes a delta compute 
engine, a raw vector compute engine, a motion vector with 
respect to top left corner of picture block, or a combination 
of these logic circuits. The delta compute engine is adapted 
to generate a delta from a motion code and a motion residual 
and to compute a predicted motion vector in consideration of 
a motion vector of a previous macroblock. The raw vector 
compute engine is adapted to generate a raw vector from a 
delta and a predicted motion vector. The motion vector block 
is adapted to generate luma and chroma motion vectors from 
a raw vector and macroblock coordinate information. 
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1 2 

SYSTEM AND METHOD FOR MOTION Motion compensation refers to the use of motion vectors 

VECTOR EXTRACTION AND from one frame to improve the efficiency for predicting pixel 

COMPUTATION MEETING 2-FRAME STORE values of an adjacent frame or frames. Motion compensation 

AND LETTERBOXING REQUIREMENTS is used for encoding/compression and decoding/ 

5 decompression. The prediction method or algorithm uses 

CROSS-REFERENCE TO RELATED motion vectors to provide offset values, error information, 

APPLICATION and other data referring to a previous or subsequent video 

This application is a continuation-in-part of U.S. patent ^ a ™ e \ m „ rtA , , . , 

application No. 08/904,084 filed Jul. 31, 1997, entitled ^e MPEG-2 standard requires encoded/compressed data 

ARCHITECTURE FOR DECODING MPEG COMPLI- io lo ■* encapsulated and communicated using data packets. 

ANT VIDEO BITSTREAMS MEETING 2-FRAME AND ^ data stream 15 c °«>pnsed of Afferent layers, such as an 

LETTERBOXING REQUIREMENTS, by Surya P. Vara- IS0 ^yer and a pack layer. In the ISO layer packages are 

masi and Satish Soman. transmitted until the system achieves an ISO end code, 

where each package has a pack start code and pack data. For 

BACKGROUND OF THE INVENTION 35 the pack layer, each package may be defined as having a 

1 Field of the Invention P start coc * e » a svstem clock reference, a system header, 

™_. . . . 11 * .l * u <? n- j- and packets of data. The system clock reference represents 

This invention relates generally to the field of multimedia ^ * ^ 

systems, and more particularly to a video decoding device '„ , c . c . 

i. • i_*i * * * ^-i -j * * j * While the syntax for coding video information into a 

having the ability to meet particular predetermined trans- . , - 4 . . , , a a -.l- *i_ 

• ■ j j- i ♦ • ♦ m. *j j j* j • ?n single MPEG-2 data stream are rigorously defined withm the 

mission and display constraints. The video decoding device zu _ - x. • c j j- 

i , *1 . r xa r> * -To MPEG-2 specification, the mechanisms for decoding an 

is particularly suited for Motion Picture Expert Group WT *™ „ j j j j ■ • ■ r. , 

r\Ai>i:r\ A ntn nn A A^ mnrocc .^ n Ttn^nr^c MPEG-2 data stream are not. This decoder design is left to 

(MPEG) data compression and decompression standards. . , ... ^ i -j- .i_ 

v 7 . . *1 the designer, with the MPEG-2 spec merely providing the 

2. Description of the Related Art [QSu[ts which must be achieved by sucn decoding. 

Multimedia software applications including motion pic- 25 Devices employing MPEG-1 or MPEG-2 standards con- 

tures and other video modules employ MPEG standards in sist of combinalion transmitter/encoders or receiver/ 

order to compress, transmit, receive, and decompress video decoders, as well as individual encoders or decoders. The 

data without appreciable loss. Several versions of MPEG restr ictions and inherent problems associated with decoding 

currently exist or are being developed, with the current an encoded signal and transmitting the deco ded signal to a 

standard being MPEG-2. MPEG-2 video is a method for 30 viewing device, such as a CRT or HDTV screen indicate that 

compressed representation of video sequences using a com- desigQ and realization of an MPEG-compliant decoding 

mon coding syntax. MPEG-2 replaces MPEG-1 and device ^ more man that of m encoding device, 

enhances several aspects of MPEG-1. The MPEG-2 stan- Generally speaking, once a decoding device is designed 

dard includes extensions to cover a wider range of which operates under a particular set 0 f constraints, a 

applications, and includes the addition of syntax for more 35 designer can prep are an encoder which encodes signals at 

efficient coding of interlaced video and the occurrence of the required constraints, said signals being compliant with 

scalable extensions which permit dividing a continuous the decoder . This disclosure primarily addresses the design 

video signal into multiple coded bitstreams representing of an MpEG comp^ deco der. 

video at different resolutions, picture quality, or frame rates Various , m MPEG . 2 standards arc avail . 

The primary target application of MPEG-2 is the all-digital 4Q afe]e ^ Parlicular aspects of known availabU; decoders 

broadcast of TV quality video signals at coded bitrates ^ described 

o^w 4 9 r Mbil/sec - MPEG " 1 1 wa f °P timized f [° c r Frame Storage Architecture 

CD-ROM or applications transmitted in the range of 1.5 Prevk)US s ^ either thfee Qf (wo and a half frame 

Mbit/sec, and video was unitary and non-interlaced. storagc for stQrage in memory 

An encoded/compressed data stream may contain mul- 45 Frame st0 rage works as follows. In order to enable the 

tiple encoded/compressed video and/or audio data packets or decoding of B-frames, two frames worth of memory must be 

blocks. MPEG generally encodes or compresses video pack- available to store the backward and forward anchor frames, 

ets based on calculated efficient video frame or picture Mosl systems stored either a three frame or two and a half 

transmissions. frames to enable B-frame prediction. While the availability 

Three types of video frames are defined. An intra or $o of multiple frames was advantageous (more information 

I -frame is a frame of video data including information only yields an enhanced prediction capability), but such a require - 

about itself. Only one given uncompressed video frame can me nt tends to require a larger storage buffer and takes more 

be encoded or compressed into a single 1-frame of encoded (i me to perform prediction functions. A reduction in the size 

or compressed video data. of memory chips enables additional functions to be incor- 

A predictive or P-frame is a frame of video data encoded 55 porated on the board, such as basic or enhanced graphic 

or compressed using motion compensated prediction from a elements, or channel decoding capability. These elements 

past reference frame. A previous encoded or compressed also may require memory access, so incorporating more 

frame, such as an I-frame or a P-frame, can be used when memory on a fixed surface space is highly desirable, 

encoding or compressing an uncompressed frame of video Similarly, incorporating functional elements requiring 

data into a P-frame of encoded or compressed video data. A eo smaller memory space on a chip is also beneficial, 

reference frame may be either an I-frame or a P-frame. Scaling 

A bidirectional or B-frame is a frame of video data The MPEG-2 standard coincides with the traditional 

encoded or compressed using motion compensated predic- television screen size used today, thus requiring transmis- 

tion from a past and future reference frame. Alternately, the sion having dimensions of 720 pixels (pels) by 480 pixels. 

B-frame may use prediction from a past or a future frame of 65 The television displays every other line of pixels in a raster 

video data. B-frames are particularly useful when rapid scan. The typical television screen interlaces lines of pels, 

motion occurs within an image across frames. sequentially transmitting every other line of 720 pels (a total 
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of 240 lines) and then sequentially transmitting the remain- 
ing 240 lines of pels. The raster scan transmits the full frame 
at Via second, and thus each half-frame is transmitted at Vfeo 
second. 

For MPEG storage method of storing two and a half 
frames for prediction relates to this interlacing design. The 
two and a half frame store architecture stores two anchor 
frames (either I or P) and one half of a decoded B frame. A 
frame picture is made up of a top and a bottom field, where 
each field represents interlaced rows of pixel data. For 
example, the top field may comprise the first, third, fifth, and 
so forth lines of data, while the bottom field comprises the 
second forth, sixth, and so on lines of data. When B frames 
are decoded, one half the picture (either the top field or the 
bottom field) is displayed. The other half picture must be 
stored for display at a later time. This additional data 
accounts for the "half frame" in the two and a half frame 
store architecture. 

In a two frame store architecture, there is no storage for 
the second set of interlaced lines that has been decoded in a 
B-frame. Therefore, an MPEG decoder that supports a two 
frame architecture must support the capability to decode the 
same picture twice in the amount of time it takes to display 
one picture. As there is no place to store decoded B-frame 
data, the output of the MPEG decoder must be displayed in 
real time. Thus the MPEG decoder must have the ability to 
decode fast enough to display a field worth of data. 

A problem arises when the picture to be displayed is in 
what is called the "letterbox" format. The letterbox format is 
longer and narrower than the traditional format, at an 
approximately 16:9 ratio. Other dimensions are used, but 
16:9 is most common. The problem with letterboxing is that 
the image is decreased when displayed on screen, but picture 
quality must remain high. The 16:9 ratio on the 720 by 480 
pel screen requires picture on only 3 A of the screen, while the 
remaining Va screen is left blank. In order to support a 
two-frame architecture with a letterboxing display which 
takes Va of the screen, a B-frame must be decoded in Va the 
time taken to display a field of data. 

The requirements to perform a two frame store rather than 
a two and a half or three frame store coupled with the desire 
to provide letterbox imaging are significant constraints on 
system speed which have not heretofore been achieved by 
MPEG decoders. 

It is therefore an object of the current invention to provide 
an MPEG decoding system which operates at 54 Mhz and 
sufficiently decodes an MPEG data stream while maintain- 
ing sufficient picture quality. 

It is a further object of the current invention to provide an 
MPEG decoder which supports two frame storage. 

It is another object of the current invention to provide a 
memory storage arrangement that minimizes on-chip space 
requirements and permits additional memory and/or func- 
tions to be located on the chip surface. A common memory 
area used by multiple functional elements is a further 
objective of this invention. 

It is yet another object of the current invention to provide 
an MPEG decoder which supports signals transmitted for 
letterbox format. 

SUMMARY OF THE INVENTION 

In a preferred exemplary embodiment of the present 
invention, a system for motion vector extraction and com- 
putation includes an architecture adapted to overlap a data 
extraction process with a computation process and to pro- 
vide two-frame store decode with letterbox scaling capabil- 
ity. 
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In another aspect of the present invention, a system for 
motion vector extraction and computation includes an archi- 
tecture adapted to overlap a data extraction process with a 
computation process and to provide two-frame store decode 

5 with letterbox scaling capability, to extract a plurality of 
parameters usable for calculating a motion vector, and to 
compute motion vectors. The architecture is adapted to 
compute vertical and horizontal components of motion 
vectors in back-to-back cycles and includes a motion vector 

10 compute pipeline. 

In a preferred embodiment of the present invention, the 
aforementioned motion vector compute pipeline includes a 
delta compute engine, a raw vector compute engine, a 
motion vector with respect to top left corner of the picture 

is block, or a combination of these logic circuits. The delta 
compute engine is adapted to generate a delta from a motion 
code and a motion residual and to generate a predicted 
motion vector in consideration of a motion vector of a 
previous macroblock. The raw vector compute engine is 

20 adapted to generate a raw vector from a delta and a predicted 
motion vector. The motion vector with respect to top left 
corner of the picture block is adapted to generate luma and 
chroma motion vectors from a raw vector and macroblock 
coordinate information. 

25 In another aspect of the present invention, a method for 
motion vector extraction and computation includes the step 
of providing an architecture adapted to overlap a data 
extraction process with a computation process, to extract a 
plurality of parameters usable for calculating a motion 

30 vector, and to compute vertical and horizontal components 
of motion vectors in back-to-back cycles. 

Other objects, features, and advantages of the present 
invention will become more apparent from a consideration 
of the following detailed description and from the accom- 

35 panying drawings. 

DESCRIPTION OF THE DRAWINGS 

FIG. 1 illustrates the MPEG video decoder 100 according 
to the current invention; 

40 

FIG. 2 is a detailed illustration of the TMCCORE in 
accordance with the current invention; 

FIG. 3 presents the timing diagram for the transmission of 
data through the TMCCORE; 
45 FIG. 4 shows the staggered timing of data transmission 
through the TMCCORE; 

FIG. 5A illustrates the data blocks received by the 
MBCORE; 

FIG. 5B shows the data blocks received by the MBCORE 
50 after 16 bits of data have been transmitted to the system; 
FIG. 6 shows the hardware implementation of the Data 
Steering Logic; 

FIG. 7 is a flowchart illustrating operation of the Data 
5s Steering Logic; 

FIG. 8 is a flowchart of the DCT processor multiplication 
logic; 

FIG. 9 illustrates the implementation of IDCT Stage 1 
which functionally calculates X G P; 
60 FIG. 10 is the design for IDCT stage 2, which transposes 
the result from IDCT Stage 1 and multiplies the resultant 
matrix by P; 

FIG. 11 shows the system design for performing the final 
functions necessary for IDCT output and storing the values 
65 in appropriate positions in IDCT OUTPUT RAM; 

FIG. 12 represents the numbering of pels for use in 
motion compensation; 
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FIG. 13 is the mechanization of the motion compensation 
unit used to satisfy two frame store and letterboxing require- 
ments; 

FIG. 14 is a diagram illustrating a current macroblock 
being decoded but in the reference picture and a motion 5 
vector r signifying a relative position of the reference 
macroblock with respect to the current macroblock; 

FIG. 15 is a functional flowchart showing the steps of 
extraction of parameters needed for the computation of 
motion vectors and the computation process; 30 

FIG. 16 is a diagram illustrating a worst case scenario of 
the pipe-lined operation of the motion vector extraction and 
computation process; 

FIG. 17 is functional block diagram of the delta compute 1S 
engine of the micro-architecture of the motion vector com- 
pute pipeline; 

FIG. 18 is functional block diagram of the range clipping 
and raw vector compute engine of the micro-architecture of 
the motion vector compute pipeline; 2 o 

FIG. 19 is functional block diagram of the motion vector 
wrt top-left corner of the picture block of the micro- 
architecture of the motion vector compute pipeline; and 

FIG. 20 illustrates a bitstream processing flow in mbcore 
for a low delay mode operation. 25 

DETAILED DESCRIPTION OF THE 
INVENTION 

The requirements for supporting a two frame architecture 
as well as letterbox scaling are as follows, using NTSC. 30 
Letterbox scaling only transmits V* of a full screen, leaving 
the top and bottom eighth of the screen blank at all times. 
For letterbox scaling, a total of 360 (or %*480) lines of 
active video must be displayed. For a two frame store 
system, with a 45 by 30 macroblock picture, 360 lines of 35 
active video divided by 30*525 seconds is available, or 
approximately 0.02286 seconds are available to decode the 
45 by 30 macroblock arrangement. With 30 rows of 
macroblocks, the time to decode one full row of macrob- 
locks is (360/(30 *525))/30 seconds, or approximately 40 
761.91 microseconds. The time to decode one macroblock is 
761.91/45 or 16.93 microseconds. With two frame store, 
double decoding is necessary, and the time available to 
decode a macroblock is 16.93/2 microseconds, or 8.465 
microseconds. 45 
Decoder Architecture 

FIG. 1 illustrates the MPEG video decoder 100 according 
to the current invention. The system passes the compressed 
bitstream 101 to MBCORE 102 (Macro Block core), which 
passes data to TMCCORE 103 (Transformation/Motion 50 
Compensation core) and Reference Subsystem 104. TMC- 
CORE 103 passes information to MBCORE 102, and pro- 
duces reconstructed macroblocks. 

The MBCORE 102 operates as both a controller and a 
parser. The MBCORE 102 primary function is to parse the 55 
compressed bitstream 101 and generate DCT coefficients 
and motion vectors for all macroblocks. The DCT coeffi- 
cients are then passed to the TMCCORE 103 for further 
processing, and the MBCORE 102 passes the motion vec- 
tors to the Reference Subsystem 104 for further processing. 60 

The MBCORE 102 comprises video bitstream symbol 
extractor 105 and state machines 106. MBCORE 102 reads 
the compressed bitstream 101 and if the compressed bit- 
stream is in VLC (Variable Length Coding), the MBCORE 
decompresses the bitstream using the video bitstream sym- 65 
bol extractor 105, detailed below. The MBCORE further 
comprises DCT processor 107, which enables the MBCORE 
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102 to calculate and provide DCT coefficients to the TMC- 
CORE 103 and motion vectors to the Reference Subsystem 
104. 

The TMCCORE 103 receives DCT and motion vector 
information for a series of macroblocks and performs the 
inverse discrete cosine transfer for all data received. The 
TMCCORE 103 receives the discrete cosine transfer data 
from the MBCORE 102, computes the inverse discrete 
cosine transform (IDCT) for each macroblock of data, 
computes a motion vector difference between the current 
frame and the reference frame by essentially "backing out" 
the difference between the current frame and reference 
frame, and combines this motion vector difference with the 
IDCT coefficients to produce the new frame using motion 
compensation. The TMCCORE 103 also executes pel com- 
pensation on reference data received from the Reference 
Subsystem 104, and reconstructs the new frame using infor- 
mation from the Reference Subsystem 104 and the 
MBCORE 102. 

The Reference Subsystem 104 receives motion vectors 
from the MBCORE 102. The Reference Subsystem 104 
determines the location of necessary motion related 
information, such as previous frame data and current frame 
data, to support the TMCCORE 103 in compensation and 
reconstruction. The Reference Subsystem 104 acquires such 
information and provides it to the TMCCORE 103. 

As noted above, the timing for performing the necessary 
parsing, coefficient generation, transmission, and picture 
reconstruction functions is critical. Data is transmitted to the 
MBCORE 102 as follows: a slice header and macroblock 
data passes to the MBCORE 102, followed by the DCT 
coefficient data for a particular macroblock of data. The slice 
header and macroblock data take 30 cycles for transmission, 
and thus the MBCORE does not transmit DCT data for 30 
cycles. Transmission of one macroblock of data requires the 
initial 30 cycle period, followed by six 64 cycle 
transmissions, and then the procedure repeats. 

The MBCORE 102 takes 50 cycles to parse the video 
bitstream from the slice start code, i.e. a data block indicat- 
ing the beginning of a particular bitstream arrangement, to 
generating the first coefficients for the IQ stage of the 
TMCCORE 103. 

Operation of the MBCORE is as follows. The MBCORE 
initially accepts and parses the 50 cycles up to the block 
layer. The MBCORE then generates one DCT coefficient per 
cycle, and takes a total of (64+l)*5+64 cycles, or 389 cycles, 
to generate all the DCT coefficients for a given macroblock. 
The MBCORE passes a total of 384 DCT coefficients (64* 6) 
to the TMCCORE 103, which accepts one block of coeffi- 
cient data into IDCT Stage 1. 

A detailed illustration of the TMCCORE is presented in 
FIG. 2. After a full block of IDCT coefficient data passes 
through the IDCT Stage 1 data path, which can conceptually 
be analogized to a pipeline, IDCT Stage 2 computation 
begins on the IDCT Stage 1 processed data. Hence IDCT 
Stage 1 data is stored by the system in RAM and the IDCT 
Stage 1 data is subsequently received by IDCT Stage 2 
within the TMCCORE 103. IDCT Stage 1 operates as soon 
as it receives the data from the MBCORE 102. IDCT Stage 
2, however, is one block delayed due to the processing, 
storage, and retrieval of the IDCT data. The arrangement of 
the timing of the IDCT stages and the transmission of data 
within the TMCCORE 103 are presented below. 
Data Transmission Method 

FIG. 3 presents the liming diagram for the transmission of 
data through the TMCCORE 103. From FIG. 3, the zero 
block of data, comprising 64 units of data and taking 64 
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cycles, is processed in the IQ/IDCT Stage 1 pipeline ini- Video Bitstream Symbol Extractor/Data Steering Logic 

tially. A gap occurs between the six 64 blocks of data, taking The decoder of FIG. 1 must have the ability to decode a 

one cycle. The one block of data is subsequently processed VLD (variable length DCT) in every clock cycle. The 

by the IQ/IDCT Stage 1 pipeline at the time the IDCT Stage MBCORE 102 receives one DCT coefficient per cycle, and 

2 processes the zero block data. Processing continues in a 5 comprises in addition to an inverse DCT function a video 

staggered manner until the four block is processed in IDCT bitstream symbol extractor 105. Data in the bitstream is 

Stage 1 and the three block in IDCT Stage 2, at which time compressed, and thus the MBCORE 102 must extract the 

the system begins reconstruction of the picture. necessary symbols from the bitstream, which may vary in 

With the 4:2:0 ratio, the TMCCORE 103 receives four size. The largest symbol which must be extracted is 32 bits 

luminance pixels and two chrominance pixels. At the end of 10 according to the MPEG standard. The data steering logic for 

the four luminance pixels, the TMCCORE 103 initiates the video bitstream symbol extractor permits the MBCORE 

reconstruction of the picture. 102 to read the symbols irrespective of symbol size. 

Total time for the process is 64 cycles multiplied by 6 The MBCORE 102 receives compressed video data in a 

blocks«384 cycles, plus five one cycle gaps, plus the 35 linear fashion as illustrated in FIG. 5 A. W0,0 represents 

cycles for header processing, plus a trailing five cycles to 15 Word 0, bit 0, while Wl,31 represents Word 1, bit 31, and 

complete reconstruction, for a total of 429 cycles. Recon- so forth. Time progresses from left to right, and thus the data 

struction takes 96 cycles. bitstream enters the video decoder from left to right in a 

The staggered timing arrangement for processing the data sequential manner as illustrated in FIG. 5A. As parsing is 

permits the functions of the MBCORE 102 and TMCCORE performed, compressed data consumed by the system is 

103 to overlap. This overlap permits the MBCORE 102 to 20 flushed out of the register and new data is shifted into the 

operate on one macroblock of data while the TMCCORE register. This flushing of consumed data and maintenance of 

103 operates on a second macroblock. Prior systems unconsumed data is performed by the data steering logic, 

required full loading of a single macroblock of data before FIG. 5B illustrates the appearance of the data after a 16 bit 

processing the data, which necessarily slowed the system symbol is consumed. The data comprising W0,0 ... 15 is 

down and would not permit two-frame store and letterbox 25 consumed by the system, leaving all other data behind. The 

scaling. problem which arises is that upon consuming a 16 bit 

FIG. 4 shows the MBCORE/TMCCORE macroblock symbol, the next symbol may be 30 bits in length, thereby 

decoding overlap scheme. Again, header data is received by requiring excess storage beyond the 32 bit single word 

the MBCORE 102, followed by zero block data, which are length. The tradeoff between timing and space taken by 

passed to IQ/IDCT Stage 1 processing. TMCCORE IDCT 30 performing this shifting function is addressed by the data 

Stage 2 subsequently processes the zero block data, at the steering logic. 

same time IQ/IDCT Stage 1 processes one block data. The Data steering logic is presented in FIG. 6. According to 

staggered processing progresses into and through the recon- the data steering logic, the CPU first instructs the data 

struction stage. During reconstruction, the five block is steering logic to initiate data steering. Upon receiving this 

received and processed in IDCT Stage 2, at which time the 35 initiation signal, the data steering logic loads 32 bit first flop 

MBCORE begins receipt of data from the subsequent mac- 601 and 32 bit second flop 602 with 64 bits of data. The data 

roblock. Five block and picture reconstruction completes, at steering logic then resets the total_used_bits counter 603 to 

which time zero block for the subsequent macroblock is zero and indicates that initialization is complete by issuing 

commencing processing within IQ/IDCTStage 1. This is the an initialization ready signal to the CPU. 

beneficial effect of overlapping processing. 40 Once the MBCORE 102 begins receiving video data, state 

In order to perform full merged store processing, wherein machines 106 within the MBCORE 102 examine the value 

the IDCT data and the motion vector data is merged within coming across the data bus and consume some of the bits, 

the TMCCORE 103, both sets of data must be synchronized This value is called "usedbits" and is a six bit ([5:0]) bus. 

during reconstruction. From the drawing of FIG. 4, the The total number of used bits, total_used[5:0], is the sum of 

motion vector data is received at the same time the IDCT 45 total_used_bits[5:0] and usedbits[5:0]. Total__used_bits 

Stage 2 data is received and processed. The sum of the IDCT are illustrated in FIG. 6 as flop 604. Bit usage via flop 604 

Stage 2 data and the motion vector data establishes the and total_used_bits counter 603 is a side loop used to track 

picture during reconstruction, and that picture is then trans- the status of the other flops and barrel shifter 605. 

mitted from the TMCCORE 103. Data is sequentially read by the system and passed to the 

The total number of cycles required to decode the video 50 barrel shifter, and subsequently passed to resultant data flop 

bitstream from the slice header and ship out six blocks of 608. 

coefficients is 429 cycles. The TMCCORE IDCT Stage 2 For example, the initial value of usedbits is 0. A con- 

and Reconstruction takes fewer cycles than the MBCORE sumption of 10 bits, representing a 10 bit symbol, by the 

parsing and shipping of data. With the staggered processing state machines 106 yields a total„used_bits of 10. Hence 

arrangement illustrated above, the MPEG video processor 55 the total_used is 10. These 10 bits are processed using first 

illustrated here can decode the bitstream in 429 cycles flop bank MUX 606 and loaded into barrel shifter 605. 

(worst case). total_used is a six bit wide bus. The range of values that 

From the requirements outlined above for the letterbox may be stored using total_used is from 0 to 63. When the 

format and two frame store, the minimum frequency at value of total_used_bits is greater than 63, the value of 

which the MBCORE 102 and the TMCCORE 103 must 60 total_used_bits wraps back around to zero, 

operate at to achieve real time video bitstream decoding is When total_used is greater than 32 and less than or equal 

1/8.465 microseconds/429 cycles, or 50.67 Mhz. Thus by to 63, first flop bank 601 is loaded with new data. When 

overlapping the decoding of the macroblocks using the total_used is greater than or equal to zero and less than 32, 

invention disclosed herein, the MBCORE and the TMC- the data steering logic loads second flop bank 602 with data. 

CORE together can perform MPEG-2 MP/ML decoding 65 Continuing with the previous example, the first 10 bit 

with a two frame store architecture and letterbox decoding symbol is processed by first flop bank MUX 606 and loaded 

with a clock running at 54 Mhz. into barrel shifter 605, usedbits set to 10, total_used set to 
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10, and total_bits__used set to 10. The next symbol may 
take 12 bits, in which case the system processes the 12 bit 
symbol using first flop bank MUX 606 and passes the data 
to barrel shifter 605. usedbits is set to 12, which is added to 
total_used_bits (10) in total__used„bits counter 603, yield- 5 
ing a total„used of 22. 

The next data acquired from RAM may be a large symbol, 
having 32 bits of length. Such a symbol spans both first flop 
601 and second flop 602, from location 23 in first flop 601 
through second flop 602 location 13. In such a situation, 1Q 
usedbits is 32, and the data is processed by first flop bank 
MUX 606 and second flop bank MUX 607. usedbits is set 
to 32, which is added to total_used__bits (22) in total_ 
used_bits counter 603, yielding a total_used of 54. 

With a total_used of 54, the system loads new data into 
first flop 601 and continues with second flop 602. 15 

Barrel shifter 605 is a 32 bit register, and thus the addition 
of the last 32 bit segment of processed data would fill the 
barrel shifter 605. Hence the data from barrel shifter 605 is 
transferred out of barrel shifter 605 and into resultant data 
flop 608. The 32 bits from first flop bank MUX 606 and 20 
second flop bank MUX 607 pass to barrel shifter 605. 

Continuing with the example, the next symbol may only 
take up one bit. In such a situation, used bits is one, which 

is added to total used_bits (54) yielding a total_used of 

55. The system processes the bit in second flop bank MUX 25 
607 and the processed bit passes to barrel shifter 605. 

The next symbol may again be 32 in length, in which case 
data from the end of second flop 602 and the beginning of 
first flop 601 is processed and passed into the barrel shifter 
605. usedbits is 32, which is added to total_used_bits (54), 30 
which sums to 87. However, the six bit size of the total_ 
used indicates a total of 23, i.e. the pointer in the barrel 
register 605 is beyond the current 64 bits of data and is 23 
bits into the next 64 bits of data. 

With a value in excess of 32 bits, the single bit residing 35 
in barrel shifter 605 passes to resultant data flop 608, and the 
32 bits pass to barrel shifter 605. The system then sequen- 
tially steps through all remaining data to process and pass 
data in an efficient manner. 

The operation of the process is illustrated graphically in 40 
FIG. 7. The first and second flop banks are loaded in step 701 
and the system initialized in step 702. The system reads data 
in step 703 and determines total_used in step 704. The 
system then determines whether total_used„bits is greater 
than 32 in step 705, and, if so, first flop bank is loaded with 45 
new data in step 706. Step 707 determines whether total_ 
used is greater than or equal to 0 and less than 32. If so, step 
708 loads the second flop bank with data. 

As long as usedbits is not equal to zero, steps 704 through 
708 are repeated. If the CPU initializes the data steering 50 
logic in the middle of the operation, the process begins at 
step 701. 

The advantage of this implementation is that it is hard- 
ware oriented and requires no interaction from a CPU or 
microcontroller. Only a single shift register is used, which 55 
provides significant area savings. The system obtains the 
benefits of using the shift register as a circular buffer in that 
the system uses total bits as a pointer into the shift register 
and loads shifted data into the resultant data register 608. 
IDCT Processor/Algorithm 60 

The TMCCORE 103 performs the IDCT transform using 
IDCT processor 107. The Inverse Discrete Cosine Trans- 
form is a basic tool used in signal processing. The IDCT 
processor 107 used in MBCORE 102 may be any form of 
general purpose tool which performs the IDCT function, but 65 
the preferred embodiment of such a design is presented in 
this section. 



The application of the IDCT function described in this 
section is within a real time, high throughput multimedia 
digital signal processing chip, but alternate implementations 
can employ the features and functions presented herein to 
perform the inverse DCT function. 

The implementation disclosed herein is IEEE compliant, 
and conforms with IEEE Draft Standard Specification for 
the Implementations of 8x8 Inverse Discrete Cosine 
Transform, P1180/D1, the entirety of which is incorporated 
herein by reference. 

Generally, as illustrated in FIG. 1, the MBCORE 102 
receives DCT data and initially processes symbols using the 
video bitstream symbol extractor 105 and subsequently 
performs the IDCT function using IDCT processor 107. 

The system feeds DCT coefficients into IDCT processor 
106 in a group of eight rows of eight columns. Each DCT 
coefficient is a 12 bit sign magnitude number with the most 
significant bit (MSB) being the sign bit. The IDCT processor 
106 processes a macroblock comprising an 8x8 block of 
pixels in 64 cycles. After processing, the IDCT processor 
transmits a data stream of eight by eight blocks. Each output 
IDCT coefficient is a nine bit sign magnitude number also 
having the MSB as a sign bit. 

The Inverse Discrete Cosine Transform is defined as: 



j 7 7 (i) 

xa,j)=-YiTj C{k)cm{k ' l) 

i«=0 1=0 

cos l^^H^6— J 



where i j=0 ... 7 is the pixel value, X(k,l), k,l=0 ... 7 is 
the transformed DCT coefficient, x(ij) is the final 
result, and 



C(0)= -=, and C(i) = 
V2 



1, i=U. 



(2) 



Equation 1 is mathematically equivalent to the following 
matrix form: 



(3) 



where X^ij^QCKijJXG.i), QOQ*Q, where Q is a 
matrix and QQ is the product of matrix Q with itself. P 
from Equation 3 is as follows: 



11 l ill l 1 

a ria+l) r[a-\) 1 -1 -r(a - 1) -r(a+l) -a 

b 1 -1 -b -b -1 1 b 

c -r{c-\) -r(c+l) -II r(c+l) r(c-l) -c 

1-1 -1 11-1 -1 1 

1 -r(c+l) /tc-1) c -c -r(c-l) r(c+l) -1 

1 -b b -1 -1 b -b 1 

1 -r{a - 1) r(a + l) -a a -r{a+l) r(o-l) -1 



where Q is: 



IvT V^TT V^TT V^Tf 
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and I is a unitary diagonal identity matrix, a is 
5.0273, b is 2.4142, c is 1.4966, and r is 07071. 

The matrix representation of the IDCT greatly simplifies 
the operation of the IDCT processor 106, since each row of 
the P matrix has only four distinct entries, with one entry 
being 1. This simplification of the number of elements in the 
IDCT matrix means that in performing a matrix 
multiplication, the system only needs three multipliers 
instead of eight, the total number of elements in each row. 

The system performs IDCT processing by performing 
multiplications as illustrated in FIG. 8. The IDCT processor 
107 receives 12 bits of DCT data input in 2's complement 
format, and thus can range (with the sign bit) from -2048 to 
+2047. The first block 801 performs a sign change to convert 
to sign magnitude. If necessary, block 801 changes -2048 to 
-2047. This yields eleven bits of data and a data bit 
indicating sign. Second block 802 performs the function 
QX'Q, which uses 0+16 bits for QQ, yielding one sign bit 
and 20 additional bits. Block 802 produces a 27 bit word 
after the multiplication (11 bits multiplied by 16 bits), and 
only the 20 most significant bits are retained. Block 803 
multiplies the results of block 802 with the elements of the 25 
P matrix, above. The P matrix is one sign bit per element and 
15 bits per element, producing a 35 bit word. The system 
discards the most significant bit and the 14 least significant 
bits, leaving a total of 20 bits. The result of block 804 is 
therefore again a one bit sign and a 20 data bits. 

Block 805 converts the sign magnitude to two's 
complement, yielding a 21 bit output. The system adds four 
blocks into each buffer, with the buffers having 22 bits each. 
Block 805 transmits all 22 bits. Block 806 performs a sign 
change to obtain QX'QP, and passes 22 bits with no carry to 
block 807. 

Block 807 performs a matrix transpose of QX'QP, yield- 
ing (QX'QP) f . Block 807 passes this transpose data to block 
808 which performs a twos complement to sign-magnitude, 
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yielding a one bit sign and a 21 bit word. Block 809 clips the 
least significant bit, producing a one bit sign and a 20 bit 
word. This result passes to block 810, which multiplies the 
result by the P matrix, having a one bit sign and a 15 bit 
word. The multiplication of a 20 bit word with 1 bit sign by 
a 15 bit word with 1 bit sign yields a 35 bit word, and the 
system discards the two most significant bits and the 13 least 
significant bits, producing a 20 bit word with a 1 bit sign out 
of block 810. The result of block 810 is sign-magnitude 
converted back to 2's complement, producing a 21 bit result 
in block 811. Block 812 performs a similar function to block 
805, and adds the four products into each buffer. The buffers 
have 22 bits each, and the output from block 812 is 22 bits. 
This data is passed to block 813, which performs a sign 
switch to obtain the elements of (QX'QP)T>. Output from 
block 813 is a 22 bit word, with no carry. Block 814 right 
shifts the data seven bits, with roundoff, and not a clipping. 
In other words, the data appears as follows: 

SIGNxxxxxxxxxxxxxXYxxxxxx (22 bit word) 
and is transformed by a seven bit shift in block 813 to: 
SIGNxxxxxxxxxxxxxX.Yxxxxxx 
Depending on the value of Y, block 814 rounds off the 
value to keep 15 bits. If Y is 1, block 814 increments the 
integer portion of the word by 1; if Y is 0, block 814 docs 
not change the integer part of the word. 

The result is a 15 bit word, which is passed to block 815. 
In block 815, if the 15 bit value is greater than 255, the block 
sets the value to 255. If the value is less than -256, it sets 
the value to -256. The resultant output from block 815 is the 
IDCT output, which is a 9 bit word from -256 to 255. This 
completes the transformation from a 12 bit DCT input 
having a value between -2048 and 2047, and a 9 bit inverse 
DCT output, having a value between -256 and 255. 

The efficiencies for matrix multiplication are as follows. 
The four factors used which can fully define all elements of 
the QQ and P matrices are as follows: 

llii 

h = - 



f =W*= 



V^ + l ' Vc2+ 1 



The parameters for all elements of the QQ and PP matrix 
are: 
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For the P matrix, 

1 =1 = 001000000000000 

a = 5.02734 = 101000001110000 

b = 2.41421 s 010011010100001 

c = 1.49661 = 001011111110010 

ria+i) = 4.26197 = 100010000110001 

r[a-l) = 2.84776 = 0101 1011 00 10001 

fic-\) = 0.351153 = 000010110100000 

r(c + l) = 1.76537 = 001 11 0001 000000 



The entire IDCT is implemented in two stages. IDCT 
Stage 1, illustrated in FIG. 9, implements X^R The second 
stage, illustrated in FIG. 10, transposes the result and 
multiplies it by P again. 

From FIG. 2, and as may be more fully appreciated from 
the illustrations of FIGS. 8 through 11, the TMCCORE 103 
receives the DOT input, produces the matrix (QX'Q)P, or 
X^P, in IDCT Stage 1 (i.e., from FIG. 8, completes through 
block 806) and stores the result in transpose RAM 923. 
IDCT Stage 2 performs the transpose of the result of IDCT 
Stage 1 and multiplies the result by P, completing the IDCT 
process and producing the IDCT output. 

As may be appreciated from FIG. 9, the representation 
disclosed is highly similar to the flowchart of FIG. 8. From 
FIG. 9, IDCT Stage 1 pipeline 900 receives data from the IQ 
block in the form of the matrix X. The Q matrix is available 
from a row/column state machine in the IQ pipeline, 
depicted by state machine registers 902. The state machine 
registers 902 pass data from register 902c to QQ matrix 
block 903 which contains QO matrix generator 904 and QO 
matrix register 905. QQ data is passed to QX'Q block 901 
which multiplies the 16 bit QQ matrix by the X block having 
one sign bit and 11 data bits in QX'Q multiplier 906. This 
multiplication is passed to QX'Q register 907, which trans- 
mits a one bit sign and a 20 bit word. QX'Q block 901 
thereby performs the function of block 802. Output from 
register 902a* is a column [2:0] which passes to P matrix 
block 908. P matrix block 908 comprises P matrix generator 
909 which produces a sign bit and three fifteen bit words to 
P matrix register 910. 

QX'Q block 901 passes the one bit sign and 20 bit word 
to (QX'Q)P block 911, which also receives the three fifteen 
bit words and one sign bit from P matrix block 908. (QX'Q)P 
block 911 performs the function illustrated in block 803 in 
three multiplier blocks 912a, 9126, and 912c. The results of 
these multiplications is passed to (QX'Q)P MUX 913, which 
also receives data from register 902e in the form row[2:0]. 
Data from register 9 02 e also passes to read address genera- 
tor 914, which produces a transpose RAM read address. The 
transpose RAM read address passes to transpose RAM 923 
and to first write address register 915, which passes data to 
write address register 916. The write address from write 
address register 916 and the read address from read address 
generator 914 pass to transpose RAM 923, along with the P 
matrix read row/column generator slate machine 1001, illus- 
trated below. (QX'Q)P MUX 913 thus receives the output 
from the three multiplier blocks 912a, 9126, and 912c as 
well as the output from register 902e, and passes data to 
(QX'Q)P register 917, which passes the (QX'Q)P matrix in 
a one bit sign and 20 bit word therefrom. As in block 804, 
these four data transmissions from (QX'Q)P block 911 pass 
to matrix formatting block 918. Matrix formatting block 918 
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performs first the function illustrated in block 802 by 
converting sign-magnitude to two's complement in two's 
complement blocks 919a, 9196, 919c, and 919d. The values 
of these four blocks 9\9a~d are added to the current values 

5 held in transpose RAM 923 in summation blocks 920a, 
920b, 920c, and 920d. The transpose RAM 923 value is 
provided via register 921. Transpose RAM 923 is made up 
of 4 eight bit by 88 bit values, and each 22 bit result from 
the four summation blocks 920a, 9206, 920c, and 920a* pass 

10 to register 922 and subsequently to transpose RAM 923. 
This completes processing for IDCT Stage 1. 

Processing for IDCT Stage 2 1000 is illustrated in FIG. 
10. P matrix read row/column generator state machine 1001 
receives a transpose RAM ready indication and provides 

is row/column information for the current state to transpose 
RAM 923 and to a sequence of registers 1002a, 10026, 
1002c, 10024 and 1002e. The information from 10026 
passes to Stage 2 P matrix block 1003, comprising Stage 2 
P matrix generator 1004 and P matrix register 1005, which 

20 yields the one bit sign and 15 bit word for the P matrix. 
From transpose RAM 923, two of the 22 bit transpose 
RAM elements pass to transpose block 1006, wherein trans- 
pose MUX 1007 passes data to registers 1008a and 10086, 
changes the sign from one register using sign change ele- 

25 ment 1009 and passes this changed sign with the original 
value from register 10086 through MUX 1010. The value 
from MUX 1010 is summed with the value held in register 
1008a in summer 1011, which yields the transpose of 
QX'QP, a 22 bit word. Thus the value of the data passing 

30 from the output of summer 1011 is functionally equal to the 
value from block 807, i.e. (QX'QP)'. Two's complement/ 
sign block 1012 performs the function of block 808, forming 
the two's complement to sign-magnitude. The LSB is 
clipped from the value in LSB clipping block 1013, and this 

35 clipped value is passed to register 1014, having a one bit sign 
and a 20 bit word. 

The output from transpose block 1006 is multiplied by the 
P matrix as functionally illustrated in block 810. This 
multiplication occurs in Stage 2 P multiplication block 1015, 

40 specifically in multipliers 1016a, 10166, and 1016c. This is 
summed with the output of register 1002c in MUX 1017 and 
passed to register 1018. This is a matrix multiplication 
which yields (QX'QP/P. Conversion block 1019 converts 
this information, combines it with specific logic and stores 

45 the IDCT values. First two's blocks 1020a, 10206, 1020c, 
and 1020a* convert sign-magnitude to two's complement, as 
in block 811, and sum this in adders 1021a, 10216, 1021c, 
and 1021a* with current IDCT RAM 1024 values, which 
comprise four 22 bit words. The sum of the current IDCT 

50 RAM values and the corrected (QX'QP)'? values summed in 
adders 1021a^d pass to IDCT RAM 1024. 

IDCT RAM 1024 differs from transpose RAM 923. IDCT 
RAM 1024 provides a hold and store place for the output of 
IDCT Stage 2 values, and comprises two 88 by 1 registers. 

55 Note that IDCT RAM 1024 feeds four 22 bit words back to 
adders 1021a-4 one word to each adder, and passes eight 22 
bit words from IDCT Stage 2 1000. 

RAM also utilizes values passed from register 1002a*, i.e. 
the position of read/write elements or the state of the 

60 multiplication. Register 1002a* passes data to read additional 
combined logic element 1022, which calculates and passes 
a read add indication and a write add indication to RAM to 
properly read and write data from adders 1021a-a*. 

Data also passes from register 1002a 1 to register 1002e, 

65 which provides information to output trigger generator 
1023, the result of which is passed to RAM as well as out 
of IDCT Stage 2 1000. The output from RAM is eight 22 bit 
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words and the output from output trigger generator 1023. For a 2x7 array of pixels, i.e. 14 pels, the numbering of 

The result functionally corresponds to the output from block pels is illustrated in FIG. 12. 

812. The system performs a half -pel compensation. Half-pel 

FIG. 11 illustrates the implementation which performs the compensation is compensating for a location between pixels, 

final functions necessary for IDCT output and stores the 5 j. e . the motion is between pixel x and pixel y. When the 

values in appropriate positions in IDCT OUTPUT RAM sys tem determines the data in FIG. 12 must be right half pel 

1115. Sign corrector 1101 receives the eight 22 bit words compensated, or shifted right one half pel, the system 

from IDCT Stage 2 1000 and multiplexes them using MUX performs the operations) outlined below. 
1102 to four 22 bit words passing across two lines. These 

values are summed in summer 1103, and subtracted in 3Q o*=(o+i)/2; if (o+])mod 2—1, 0'-O'+i; 
sub tractor 1104 as illustrated in FIG. 11. The output from 

subtractor 1104 passes through register 1105 and reverse r=(i+2)/2; if (i+2)mod 2—1, r-r+i; . . . 

byte orderer 1107, and this set of 4 22 bit words passes along , e _ ., „ ^ ^ „ ct ei 

with the value from summer 1103 to MUX 1107, which 5 K5+6)/2; * (5+6)mwl ^ 5 " 5+1 " 

passes data to register 1108. This sign corrector block When the system determines the data in FIG. 12 must be 

produces an output functionally comparable to the : output of " down half { compensated, or shifted downward one half 

block 813 essentially providing the elements of (QX QP)T>. d lfae s forms the operation(s) outlined below . 

Shift/roundoff block 1109 takes the results from sign cor- v ' 

rector 1101, converts two's complement to sign/magnitude (y-(o+7)/2; if (0+7)mod 2—1, u'-O'+i; 

in element 1110, shifts the value right seven places using 

shifters 1111a, 11116, 1111c and 1111a; rounds these values 20 r-(i+8)/2; if (i+8)mod 2—1, r-r+i; . . . 
off using round off elements 1112a, 11126, 1112c, and 11124 
and passes these to element 1113, The rounded off values 



6X6+13)/2; if (6+13)mod 2—1, 6'=6'+l. 



from round off elements 1112a-d functionally correspond to Alternately, the system may indicate the desired position 

the output from block 814. The value is limited between is between four pels, or shifted horizontally one half pel and 

-256 and +255 in element 1113, the output of which is a 15 25 down one half pel. When the system determines the data in 

bit word passed to sign block 1114, which performs a FIG. 12 must be right and down half pel compensated, or 

conversion to two's complement and passes four nine bit shifted right one half pel and down one half pel, the system 

words to IDCT OUTPUT RAM 1115. performs the operations) outlined below. 

Output from the Output Trigger Generator and the 

chroma/luma values from CBP Luma/Chroma determine the 30 ff-(o+i+7+S)/4; if (o+i + 7+8)mod 4—1, 0'-c+i; 

stage of completeness of the IDCT RAM OUTPUT. IDCT r K i +2+ 8 + 9)/2; if (i + 2 + 8 + 9)mod 4~i, ivm. 
RAM address/ID CT Done indication generator 1116, as with 

elements 914, 915, and 916, as well as elements 1022 and The aforementioned logic is implemented as illustrated in 

1023, are placekeepers or pointers used to keep track of the FIG. 13. As may be appreciated, a right half pel shift may 

position of the various levels of RAM, including the current 35 require the system to point to a position one half -pel outside 

position and the completion of the individual tasks for the block. Thus the system must compensate for odd -pel 

various levels of processing, i.e. IDCT Stage 1 progress, shifting. 

IDCT Stage 2 progress, and completion of the Stages. It is From FIG. 13, the motion compensation unit 1300 com- 

recognized that any type of bookkeeping, maintenance, or prises horizontal half pel compensatory 1301 and vertical 

pointing processing can generally maintain values and 40 half pel compensator 1302, as well as four banks of 36 flops 

placement information for reading, writing, and providing 1303a, 13036, 1303c, and 1303a*. Registers 1304a, 13046, 

current location and completion of task indications to blocks 1304c, 13044 anQl 1304? contain motion compensation data 

or elements within the system while still within the scope of having 32 bits of information. These registers pass the 

the current invention. The purpose of these elements is to motion compensation data to horizontal compensation 

provide such a bookkeeping function. 45 MUXes 1305a, 13056, 1305c, and 13054 as well as hori- 

IDCT RAM address/IDCT Done indication generator zontal adders 1306a, 13066, 1306c, and 1306a* as illustrated 

1116 receives output trigger generator 1023 output trigger in FIG. 13. For example, register 1304e passes motion 

information and CBP Luma/Chroma indications and pro- compensation data to horizontal compensation MUX 13054 

vides a write address and a Luma Done/Chroma Done IDCT which subsequently passes the information to horizontal 

indication, signifying, when appropriate, the receipt of all 50 adder 1306a" and adds this value to the value received from 

necessary lu ma/chroma values for the current macroblock. register 1304d. Register 1304a passes data to adder 1306a 

The system writes IDCT information to IDCT OUTPUT but does not pass data to any of the horizontal compensation 
RAM 1115, specifically the information passing from sign MUXes 1305a-<i This summation/MUX arrangement pro- 
block 1114 to the appropriate location based on the write vides a means for carrying out the right half-pel compen- 
address received from IDCT RAM address/IDCT Done 55 sation operations outlined above. The result of the horizontal 
indication generator 1116. IDCT OUTPUT RAM 1115 is half pel compensator 1301 is four summed values corre- 
broken into Luma (Y0, Yl, Y2, and Y3) locations, and sponding to the shift of data one half pel to the right for a 
Chroma (Cb and Cr) locations. The values of IDCT OUT- row of data. 

PUT RAM 1115 represent the complete and final IDCT As a luma macroblock has dimensions of 16x16, move- 
outputs. 60 ment of one half pel to the right produces, for the 16th 

The design disclosed herein provides IDCT values at the element of a row, a shift outside the bounds of the 16x16 

rate of 64 cycles per second. The design stores two blocks macroblock. Hence a right shift produces a 16x17 pixel 

worth of data in transpose RAM 923 between IDCT Stage macroblock, a vertical shift a 17x16 pixel macroblock, and 

1 and IDCT Stage 2. a horizontal and vertical shift a 17 by 17 pixel macroblock. 

Motion Compensation 65 The additional space is called an odd pel. 

Motion compensation for the two frame store and letter- The compensation scheme illustrated in FIG. 13 deler- 

box scaling for MPEG decoding operates as follows. mines the necessity of compensation and thereby instructs 
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the MUXes disclosed therein to compensate by adding one 
half pel to each pel position in the case of horizontal pixel 
compensation. Thus out of the 32 bits from reference logic, 
data for each pel may be shifted right one pel using the 
MUX/adder arrangement of the horizontal half pel compen- 
sator 1301. 

Vertical pel compensation operates in the same manner. 
For each of the pels in a macroblock, the data is shifted 
downward one half pel according to the vertical compensa- 
tion scheme outlined above. Vertical half pel compensator 
1302 takes and sums results from the horizontal half pel 
compensator 1301 and receives data from the four banks of 
36 flops 1303a, 1303*, 1303c, and 1303d. Data from hori- 
zontal half pel compensator 1301 passes to vertical adders 
1308a, 13086, 1308c, and 13084 along with MUXed data 
from the four banks of 36 flops 1303a, 13036, 1303c, and 
1303d. 

In cases where vertical and horizontal half pel compen- 
sation are required, the four banks of 36 flops 1303a, 13036, 
1303c, and 1303a* are used by the system to store the extra 
row of reference data expected for down half -pel compen- 
sation. This data storage in the four banks of 36 flops 
1303a-d provides the capability to perform the computa- 
tions illustrated above to vertically and horizontally shift the 
data one half pel. The result is transmitted to register 1309, 
which may then be B-picture compensated and transmitted 
to motion compensation output RAM 1311. 

Reference data averaging may be necessary for B -pictures 
having backward and forward motion vectors, or with P 
pictures having a dual-prime prediction. Either function is 
accomplished within the B-picture compensator 1310. 

Prediction may generally be either frame prediction, field 
prediction, or dual-prime. Frame pictures for half pel com- 
pensation appear as follows. 

In frame prediction, the luma reference data pointed to by 
a motion vector contains either 16x16 (unshifted), 16x17 
(right half -pel shifted), 17x16 (down half- pel shifted), or 
17x17 (right and down half-pel shifted) data. The chroma 
component, either Cr or Cb, contains either 8x8 (unshifted), 
8x9 (right half-pel shifted), 9x8 (down half-pel shifted) or 
9x9 (right and down half-pel shifted) data. 

In field prediction as well as dual-prime predictions, the 
luma reference data pointed to by a motion vector contains 
either 8x16 (unshifted), 8x17 (right half-pel shifted), 9x16 
(down half-pel shifted) or 9x17 (down and right half pel 
shifted) data. The chroma reference data, either Cr or Cb, 
contains either 4x8 (unshifted) , 4x9 (right half- pel shifted) 
, 5x8 (down half -pel shifted) or 5x9 (right and down half-pel 
shifted) data. 

Field pictures for half-pel compensation may utilize field 
prediction, 16x8 prediction, or dual-prime. Field prediction 
and dual -prime prediction are identical to frame prediction 
in frame pictures, i.e. the luma and chroma references are as 
outlined above with respect to frame prediction (16x16, 
16x17, 17x16, or 17x17 luma, 8x8, 8x9, 9x8, or 9x9 
chroma). 16x8 prediction is identical to field prediction in 
frame pictures, i.e., luma and chroma are identical as out- 
lined above with respect to field prediction (8x16, 8x17, 
9x16, or 9x17 luma, 4x8, 4x9, 5x8, or 5x9 chroma). 

The motion compensation unit 1300 accepts reference 
data 32 bits (4 pels) at a time while accepting odd pel data 
one pel at a time on the odd pel interface, The system ships 
luma reference data in units of 8x16 and chroma reference 
data in units of 4x8. Luma reference data is transferred 
before chroma reference data, and Cb chroma is shipped 
before Cr chroma. 

In accordance with the motion compensation unit 1300 of 
FIG. 13, transfer of luma and chroma data occurs as follows. 
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For luma data, assuming that luma reference data is 
represented by luma [8:0] [16:0], or that data requires both 
right and down half-pel compensation. On a cycle by cycle 
basis, luma data is transferred as follows using motion 
5 compensation unit 1300: 



15 
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Cycle 


Reference Data 


Odd-Pe! Data 


1 


Luma [0] [12:15] 


Luma [0] [17] 


2 


Luma [0] [8:11] 




3 


Luma [0] [4:7] 




4 


Luma [0] [0:3] 




5 


Luma [1] [12:15] 


Luma [1] [16] 


6 


Luma [1] [8:11] 




7 


Luma [1] [4:7] 




8 


Luma [1] [0:3] 




33 


Luma [8] [12:15] 


Luma [8] [16] 


34 


Luma [8] [8:11] 




35 


Luma [8] [4:7] 




36 


Luma [8] [0:3] 





For chroma reference data represented by Chroma [4:0] 
[8:0]. The motion compensation unit 1300 transfers data on 
a cycle by cycle basis as follows: 

25 



Cycle 


Reference Data 


Odd-Pel Data 


1 


Chroma [0] [4:7] 


Chroma [0] [8] 


2 


Chroma [0] [0:3] 




3 


Chroma [1] [4:7] 


Chroma [1] [8] 


4 


Chroma [1] [0:3] 




9 


Chroma [4] [4:7] 


Chroma [4] [8] 


10 


Chroma [4] [0:3] 





Data expected by motion compensation units for the 
combinations of picture type, prediction type, and pel com- 
pensation are as follows: 



60 









Data fetched by vector 


Picture 


Prediction 


Pel 


(in pels) 


Type 


Type 


Compensation 


Luma/Chroma 


Frame 


Frame 


None 
Right 
Vertical 
Right/Vert. 


16 x 16/8 x 8 

16 x 17/8 x 9 

17 x 16/9 x 8 
17 x 17/9 x 9 




Field 


None 
Right 
Vertical 
Right/Vert. 


8 x 16/4 x 8 

8 x 17/4 x 9 

9 x 16/5 x 8 
9 x 17/5 x 9 




Dual-Prime 


None 
Right 
Vertical 
Right/Vert. 


8 x 16/4 x 8 

8 x 17/4 x 9 

9 x 16/5 x 8 
9 x 17/5 x 9 


Field 


Field 


None 
Right 
Vertical 
Right/Vert. 


16 x 16/8 x 8 

16 x 17/8 x 9 

17 x 16/9 x 8 
17 x 17/9 x 9 




16x 8 


None 
Right 
Vertical 
Right/Vert. 


8 x 16/4 x 8 

8 x 17/4 x 9 

9 x 16/5 x 8 
9 x 17/5 x 9 




Dual-Prime 


None 
Right 
Vertical 
Right/Vert. 


16 x 16/8 x 8 

16 x 17/8 x 9 

17 x 16/9 x 8 
17 x 17/9 x 9 



65 



Reference data transfer to the TMCCORE 103 occurs as 
follows. 
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Luma Data 




Reference Motion 


Transfer Order to 


Vector Data 


Motion Compensation Unit 1300 


17 x 17 


1) 9 x 17 


2) 8 x 17 




16 x 16 


1) 8 x 16 


2) 8 x 16 


17 x 16 


1) 9 x 16 


2) 8 x 16 




16 x 17 


1)8x17 


2) 8 x 17 




8 x 16 


8 x 16 


9 x 16 


9 x 16 


8 x 17 


8 x 17 


9 x 17 


9 x 17 




Chroma Data 




Reference Motion 


Transfer Order to 


Vector Data 


Motion Compensation Unit 1300 


9x9 


1) 5 x 9 




2)4 x 9 


8x9 


1)4x9 




2)4x9 


9x8 


1)5x9 




2)4x9 


8x8 


1)4x8 




2)4 x 8 


4x8 


4x8 


4x9 


4x9 


5x8 


5x8 


5x9 


5x9 



The maximum amount of reference data (in bytes) that the 
system must fetch for any macroblock conforming to the 
4:2:0 format occurs in a frame picture/field prediction/B- 
picture, a field picture/16x8 prediction/B-picture, or a frame 
picture/dual prime. The amount of luma reference data 
expected, excluding odd pel data, is 4*9*16 or 576 bytes of 
data. The amount of luma reference data (for both Chroma 
blue and Chroma red, excluding half-pel data, is 2*4*5*8 or 
320 bytes. 

Data may be processed by the motion compensation unit 
1300 at a rate of 4 pels per cycle. The total number of cycles 
required to process the data is 576+320/4, or 224 cycles. 
This does not include odd pel data which is transferred on a 
separate bus not shared with the main data bus. 
Motion Vector Logic Sharing and Pipelining 

In order to implement a MPEG video decoder that can 
meet constraints of a 2-frame store architecture and a 
letterbox scaling simultaneously, it is imperative that the 
motion vectors be decoded in the worst case in a minimum 
number of cycles. 

In the following sections, an architecture of a motion 
vector extraction scheme that minimizes logic and achieves 
worst case motion vector decoding in 22 cycles per mac- 
roblock is disclosed. The implementation achieves the 
decoding by overlapping the data extraction process or the 
data fetch process with the computation process. 

Advantageously, the motion vector extraction and com- 
putation are performed in the least number of cycles without 
increasing hardware logic. The architecture requires no 
intervention from the CPU or microcontroller and enables 
2-frame store decode with letter box scaling capability. 
Motion Vector 
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In most decoding scenarios, a macroblock in a picture can 
be reconstructed by using information from a macroblock in 
a previous picture. A macro block is a 16x16 pixel area 
within the picture. Motion vectors encode the reference the 

5 macroblock being decoded makes to a macroblock in a 
previous picture. FIG. 14 illustrates this concept. 

In FIG. 14, the current macroblock 1402 shows the 
position of the macroblock being decoded but in the refer- 
ence picture 1404. The vector r is the motion vector whose 

1Q head is the top left corner of the current macroblock 1402 
and the tail is the top left comer of the reference macroblock 
1406 in the reference picture 1404. Therefore vector r 
signifies a relative position of the reference macroblock 
1406 with respect to the current macroblock 1402. The top 
left corner of the current macroblock 1402 is identified by 

15 the coordinates (x,y). The vector f is the vector pointing to 
the top left corner of the reference macroblock 1406 from 
the top left comer of the reference picture 1404. It is 
obtained by adding the macroblock coordinates to the vector 
r. 

20 Differential Encoding of Motion Vector 

In MPEG2, the motion vector for the current macroblock 
is encoded by taking a difference between the current motion 
vector and the motion vector of the previous macroblock. 
The difference, called delta, is then encoded using huffman 

25 encoding. This type of coding minimizes the bits required to 
represent motion vectors. 

In order to compute motion vectors, the motion vector of 
the previous macroblock is added to the delta which is 
extracted from the compressed picture bitstream. 

30 Flow of Motion Vector Computation 

FIG. 15 shows the steps of extraction of parameters 
needed for the computation of motion vectors and the 
computation process. Details of the MPEG 2 terms used are 
set forth in ISO standards 11172 and 13818 which are 

35 incorporated herein by reference. 

A macroblock may have only a forward motion vector or 
only a backward motion vector or motion vectors in both the 
directions. For each direction there may be at most 2 motion 
vectors referring to 2 different macroblocks in the reference 

^ picture. Therefore, there may be up to 4 motion vectors 
encoded in the bitstream per macroblock. FIG. 15 shows the 
logical flow of the extraction of the parameters needed for 
the calculation of motion vector, and the computation. 
The parameters to be extracted, that appear serially in the 

45 following order, are horizontal motion code, horizontal 
motion residual, horizontal dmv for dual prime vector, 
vertical motion code, vertical motion residual, and vertical 
dmv for dual prime vector. 

The computation of the motion vectors comprises: 

50 1) delta computation phase, 

2) the raw vector computation phase, 

3) dual prime arithmetic for dual prime of vector, and 

4) the reference vector computation. 

These phases are executed for the vertical as well as the 
55 horizontal components of the motion vectors. The compu- 
tation block computes the luma and the chroma vectors for 
the macroblock. 

The delta computation phase 1510 employs the motion 
code, motion residual to compute the delta or differential. 
60 The raw vector compute 1512 includes the addition of the 
delta with the predicted motion vector which is derived from 
the motion vector of the previous macroblock. 

In case dual prime vectors are present, the dual prime 
arithmetic logic needs to be applied to the raw vector. This 
65 is done by the dual prime arithmetic phase 1514. 

The reference vector compute 1516 is responsible for the 
computation of motion vector with respect to the top left 
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corner of the picture. This involves vector addition of the 
current macroblock coordinates with the raw vector. 

The ship mv logic 1518 checks whether the external logic 
is ready to accept the motion vector and ships the motion 
vector to the interface. 

Pipelined Implementation of the Motion Vector Decode 
Process 

A key aspect of the present invention is that the perfor- 
mance of the motion vector decode process is improved by 
overlapping the motion vector computation with the extrac- 
tion phase of the next motion vector of the macroblock. 
Without employing the scheme of overlapping, the extrac- 
tion phase and the computation phase the motion vector 
would require 48 cycles in the worst case. The pipeline chart 
of FIG. 16 shows the computation active at each cycle and 



3) The state machine controls for the exact engine and the 
compute engine exchange very simple handshake. The 
handshakes are "data ready" from the extract engine 
and "the ready to accept data" from the compute state 
5 machine. This simple handshake reduces control logic. 
The following sections discuss the micro-architecture of 
the motion vector compute pipeline. 
Delta Compute Engine 

FIG. 17 is functional block diagram of the delta compute 
10 engine of the micro-architecture of the motion vector com- 
pute pipeline. The motion vectors are differentially encoded 
in the bitstream. The delta compute block is primarily 
responsible for computing the differential or delta which 
when added to the predicted motion vector (derived from the 
the overlap "is highlighted. With the extraction *and the 15 motion vector of the previous macroblock) will give the 
computation phase overlapped, the cycle count for the entire complete raw motion vector. This motion vector will be 
motion vector decode for a macroblock is 22 cycles or less. off-set with respect to the current macroblock. 
FIG. 16 illustrates the overlap of the extraction phase and the delta compute block selects between the horizontal 

compute phase in a dotted rectangle. and lne vertical motion code and motion residual based on 

The dotted double arrow line traces the computation of 20 whether the pipeline is in the horizontal or the vertical phase 
every motion vector in the pipeline. In the worst case, there of lhe computation. The selected motion code is then scaled, 
are 4 motion vectors for which extraction and decode needs al a multiplier 1710, with a fscale value and then added to 
to be done numbered as a motion vector 1 through 4. the selected motion residual by a magnitude adder 1712. The 

The overlap in the computation of the horizontal raw fecale is the scaling factor employed to the scale up the 
vector (raw__nor_vec) and the vertical delta (vert_delta) as 25 motion code. During encoding, the motion code is scaled 
well as the overlap of the vertical raw vector (raw_vert_ down t0 minimize the bit length of the motion code symbol, 
vec) and the horizontal reference vector (horz_ref) are The 0Ut P ut of the magnitude adder 1712 is a sign mag- 
shown in FIG. 16. nitude value of the delta. This is then converted at block 
The scheme also handles the non-worst case scenarios 1714 to an 1-complement number and saved in a register 
cleanly. Wait States (stalls) are automaticaUy introduced if 30 1716 ^ 1-complement implies the bit-wise logical inver 
the computation pipeline is not done and the extraction "~ 



pipeline is ready to hand over the extracted information. 

The motion vector computation also includes the Dual 
Prime motion vector computation logic. The total number of 
cycles for the dual prime computation is 21 cycle in the 
worst case (dual prime in frame pictures). This includes the 
cycles for shipping all the dual prime vectors. 
Logic Sharing 

For improving the speed of the motion vector decoding, 
the computation of the vertical and the horizontal compo- 
nents of the motion vector is achieved in back to back cycles. 
This enables sharing of logic between the horizontal com- 
ponent and vertical component calculation while minimiz- 
ing the cycle count. 



sion of a value. 

This block also handles the conversion of the PMV to the 
predicted motion vector, if case frame format to field format 
is required. The motion code, motion residual and the fcodes 
35 are registered inputs to this module. A registered signal is 
one which is an output of a flip-flop. 
Range Clipping and Raw Vector Compute Engine 

FIG. 18 is functional block diagram of the range clipping 
and raw vector compute engine of the micro-architecture of 
40 the motion vector compute pipeline. This block performs, at 
adder 1802, the addition of the delta in 1-complement 
(which was calculated in the delta compute block) and the 
predicted motion vector derived from the pmv registers. The 
results of the addition are then compared with the Iow_rcg 



The compute phase is divided into the following cycles. 45 and the high_reg at comparator 1804. If the value of the 



Cycle 1 : Computes delta for the horizontal component. 

Cycle 2: Computes delta for the vertical component. 
Magnitude clipping and the raw motion vector compute 
for the horizontal component are also done in this 
cycle. 

Cycle 3: Magnitude clipping and the raw motion vector 
compute for the vertical component. In this cycle, the 
horizontal component of the luma and the chroma 
vector with respect to the top left corner of the picture 
are also computed. 

Cycle 4: The vertical component of the luma and the 
chroma vector with respect to the top left corner of the 
picture is computed. 

Cycle 5: The luma vector is shipped. 

Cycle 6: The chroma vector is shipped. 
Advantages of Logic Sharing and Pipelining 

1) The scheme of the present invention does not replicate 
logic for the horizontal and the vertical component. 



result of addition exceeds the high_reg, then the range_reg 
is subtracted from the result. If the value of the result lies 
below the low_reg, then the range__reg is added to the 
result. The output of add/subtract block 1806 is then scaled 
50 appropriately at block 1808 depending on the picture struc- 
ture and the motion vector format and stored in the PMV. 
The prediction vector, delta and the fscale are registered 
inputs to this module. 

Motion Vector WRT Top-left Corner of the Picture 
55 FIG. 19 is functional block diagram of the motion vector 
wrt (with respect to) top-left corner of the picture block of 
the micro-architecture of the motion vector compute pipe- 
line. This block performs the addition of the macroblock 
coordinates to the raw__vcctor that is computed by the range 
60 clipping and raw vector compute block. Depending on 
whether the computation is for horizontal component or 
vertical component, mb_row or mb_col is selected and the 
components of the raw_vector and dual raw vector are 



selected. The macroblock coordinates are added at adder 
2) The scheme of the present invention has the least 65 1910 to the raw vector, 'litis compulation gives the luma 
register dependency and thus has minimal context motion vector wrt the top-left corner of the picture. The 
passing. chroma motion vector is computed by performing a divide 
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by 2 operation (/2, the exact meaning of this operation is set Detailed Operation of Low Delay Mode in Mbcore 

forth in the MPEG2 specification) on the raw vector at block The mbcore is designed to check status of the prefetch 

1920 and the appropriate macroblock coordinates at block buffer 2010 at the following points of operation: 

1922 and adding the two results at adder 1924. The raw a) At the start of a slice; 

vector, duel_raw_vcctor, mb_row and mb_col are regis- 5 b) At the beginning of dct decoding; and 

tered inputs of this module. c) At the beginning of every coded block. 

The foregoing scheme results in efficient clock cycle If all blocks in a macroblock area coded, then there are 6 

utilization and is critical for meeting a macroblock decode such C ^ GC ]^ t0 fc e performed 

time of less than 8.5 microseconds. The scheme embodies a The check m case a) ensure ^ there are 264 bits Qr 33 

completely hardware based approach for the motion vector fe f da(a m fc efetch ^ 2010 before ^ 

decoding which requires no intervention by the CPU or . ' . . ™ . . t r . t . lL . . ° 

microcontroller decoding. This data requirement is the worst case require- 

Furthermore," logic is minimized via pipelining. More me * t0 u fim f h decoding up to and aU the motion vectors 
specifically, the computation has a bare minimum of com- J ^ <* ecks » n c f f b J and c) ensure that there is enough 
binational logic and there are no extra registers required to data in the P refetch buffer 2010 to complete decoding of a 
carry context of the motion vectors. The only registers 15 block of dct coefficients. This requires the prefetch buffer 
required are the ones to hold the VLD decoded motion 2010 t0 contain 224 bytes of data before the decode opera- 
parameters. The scheme ensures that the registers are not tl0n begins. 

overwritten before they are used. The byte_low signal is checked combinationally when 
Low Delay Mode Operation of Mbcore the slice processing state machine is in the state of process- 
In order to implement a MPEG video decoder that can 20 ing the slice header. The state transition in slice state 
support Video conferencing it is imperative that a low delay machine is gated by the byte__low signal. In the dct block, 
mode operation is fully supported. the byte_low signal is checked when the dct state machine 
In the following sections, an implementation of a low is processing the "end of block" symbol. At this point, the 
delay mode operation in mbcore with minimal hardware is bytes_low signal is combinationally checked and state 
disclosed. The implementation achieves the low delay mode 25 transition to the state of processing of the coefficients of the 
operation by stalling the mbcore pipeline flexibly when the next block is gated by the D ytes_Jow signal, 
data level in the prefetch buffer goes below a certain In this approach, no symbol is split between the prefetch 
threshold, e g., a predetermined threshold. The implemen- buffef and mbcore ^ 0CCUJ at dean bol 
tation introduces flexibility and fine grain control in the boundaries ^ lhe decode ^ when lhe byt es_low is 
video decoding architecture. Furthermore, it enables the 3Q act j ve 

applicability of the mbcore to Video conferencing applies- The discloscd slem for low dc , raode fa a ^p^ely 

lions and is achieved through a low-cost hardware imple- hardware based approach . It requires no intervention by the 

mentation. cpu 0f m i croC ontroller. The byte_low signal is generated 

Low Delay Mode wm ,„„ , , in hardware and used in hardware without CPU intervention. 

In Video conferencing applications using MPEG2, both 35 ^ checks m imp i emented using simple gating Iogic at 

the rate of data transfer and the quantity of the data trans- minima , check ims Thus> , he djsclosed m aUows fof 

erred are significantly small. Under this scenario, it is very completely asynchronous operation of the prefetch buffer 

likely that a prefetch buffer, which holds the compressed and Jj,^^ 

picture data to be decoded, would run out of data while ^ disclosed ^ essenlial for Vldeo conferencing 

mbcore is still processing a macroblock. This situation n ■ MPEG2 0 nce the prefetch buffer threshold is 

mandates that mbcore be gracefully stalled so that it does not reacQed! th(J decoding is extr emely fast. There is no extra 

decode any invalid picture data At the same time, it is w>ft sta , e jnserted tQ restan bi , stream processing Md tne 

necessary for mbcore to restart decoding, loosing muiimal processing ^nsumes a minimal number of cycles (less than 

time, when valid compressed bitstream becomes available. g 5 micr o S econds). This is possible because the symbols are 

The mbcore implementation described below supports this 4J not jj ( 

scheme and uses minimal hardware to implement it. Furthermore, the system advantageously allows the size 

How of Operations in Mbcore of the prefetch buffer to be minimized. With this scheme, the 

According to a preferred embodiment of the present pre f etcn bu ffe r needs 

to be no larger than 224 bytes. This 

invention, the operation of mbcore is functionally divided scheme aIbws fof the pfefetch buffef tQ be fiUed - n paralld 

into four stages: 50 tQ consumpt i on 0 f lne vi de0 data f rom lne pre fetch buffer. If 

1. slice header processing such a scheme did not ex js t , either of the following two 

2. macroblock header processing schemes would be required: (a) the prefetch buffer would 

3. motion vector extraction and computation have to be large enough to hold the entire picture; or (b) the 

4. discrete cosine transform coefficient extraction. decoder would have to request symbols from the memory 
The data for each of these stages appear serially in the 55 system directly every time it needs symbols. Scheme (a) is 

video bitstream which is stored in the prefetch buffer. FIG. prohibitive because of the memory requirement of the 

20 illustrates the top level division of the bitstream process- decoder. Scheme (b) is prohibitive because of the high 

ing flow in mbcore. memory bandwidth requirements of the decoder and also 

There is data steering logic in mbcore that steers data from because such memory requests may cause symbol splitting 

a prefetch buffer 2010 to the various functional blocks 2012, 60 between memory requests which will disallow single cycle 

2014, 2016, 2018 of mbcore as required by the blocks. The VLD symbol decode. 

bytes__!ow, when asserted, indicates that the prefetch logic While the invention has been described in connection 

is about to run out of data. On receiving the bytes_low with specific embodiments thereof, it will be understood that 

signal, the functional blocks stop requesting data from the the invention is capable of further modifications. This appli- 

data steering logic which in turn desserts the read request to 65 cation is intended to cover any variations, uses or adapta- 

the prefetch buffer 2010. The operation resumes when the tions of the invention following, in general, the principles of 

data becomes available with no cycle hit. the invention, and including such departures from the 
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present disclosure as come within known and customary 
practice within the art to which the invention pertains. 
What is claimed is: 

1. A system for motion vector extraction and computation 
comprising: 5 

an architecture adapted to overlap an image extraction 
process with a computation process to provide 2-frame 
store decode with letter box scaling capability and to 
compute one or more motion vectors, wherein said 
overlap allows (i) said image extraction process to 1Q 
operate on a first block of data and (ii) said computation 
process to simultaneously operate on a second block of 
data. 

2. The system for motion vector extraction and compu- 
tation of claim 1 wherein: 

said architecture is adapted to extract a plurality of 35 
parameters usable for calculating a motion vector. 

3. The system for motion vector extraction and compu- 
tation of claim 2 wherein: 

said plurality of parameters comprise horizontal motion 
code, horizontal motion residual, horizontal dmv for 20 
dual prime vector, vertical motion code, vertical motion 
residual, and vertical dmv for dual prime vector param- 
eters. 

4. The system for motion vector extraction and compu- 
tation of claim 1 wherein: 25 

said architecture is adapted to compute motion vectors 
according to a multi-phase motion vector computation 
process. 

5. The system for motion vector extraction and compu- 
tation of claim 4 wherein: 30 

said multi-phase motion vector computation process com- 
prises: 

a delta computation phase; 
a raw vector computation phase; 
a dual prime arithmetic for dual prime of vector phase; 
and 

a reference vector computation phase. 

6. The system for motion vector extraction and compu- 
tation of claim 1 wherein: 

said architecture is adapted to compute vertical and hori- 
zontal components of motion vectors in back to back 
cycles. 

• 7. The system for motion vector extraction and compu- 
tation of claim 1 wherein: 

said architecture comprises a motion vector compute 
pipeline. 

8. The system for motion vector extraction and compu- 
tation of claim 7 wherein: 

said motion vector compute pipeline comprises logic 50 
which is adapted to operate without inputs from a CPU 
or microcontroller. 

9. The system for motion vector extraction and compu- 
tation of claim 7 wherein: 

said motion vector compute pipeline comprises a delta 55 
compute engine adapted to generate a delta from a 
motion code and a motion residual. 

10. The system for motion vector extraction and compu- 
tation of claim 9 wherein: 

said delta compute engine is adapted to generate a pre- 60 
dieted motion vector in consideration of a motion 
vector of a previous macroblock. 

11. The system for motion vector extraction and compu- 
tation of claim 10 wherein: 

said motion vector compute pipeline comprises a raw 65 
vector compute engine adapted to generate a raw vector 
from said delta and said predicted motion vector. 
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12. The system for motion vector extraction and compu- 
tation of claim U wherein: 

said motion vector compute pipeline comprises a motion 
vector with respect to top left corner of picture block 
adapted to generate luma and chroma motion vectors 
from said raw vector and macroblock coordinate infor- 
mation. 

13. The system for motion vector extraction and compu- 
tation of claim 12 wherein: 

said motion vector block is adapted to generate said 
chroma motion vector according to a divide by 2 
operation, 

14. A system for motion vector extraction and computa- 
tion comprising: 

an architecture adapted to overlap an image extraction 
process with a computation process and to provide 
2-frame store decode with letter box scaling capability, 
to extract a plurality of parameters usable for calculat- 
ing a motion vector, and to compute motion vectors; 

said architecture being adapted to compute vertical and 
horizontal components of motion vectors in back to 
back cycles; and 

said architecture comprising a motion vector compute 
pipeline, wherein said overlap allows (i) said image 
extraction process to operate on a first block of data and 
(ii) said computation process to simultaneously operate 
on a second block of data. 

15. The system for motion vector extraction and compu- 
tation of claim 14 wherein: 

said motion vector compute pipeline comprises a delta 
compute engine adapted to generate a delta from a 
motion code and a motion residual and to generate a 
predicted motion vector in consideration of a motion 
vector of a previous macroblock. 

16. The system for motion vector extraction and compu- 
tation of claim 14 wherein: 

said motion vector compute pipeline comprises a raw 
vector compute engine adapted to generate a raw vector 
from a delta and a predicted motion vector. 

17. The system for motion vector extraction and compu- 
tation of claim 14 wherein: 

said motion vector compute pipeline comprises a motion 
vector with respect to top left corner of picture block 
adapted to generate luma and chroma motion vectors 
from a raw vector and macroblock coordinate informa- 
tion. 

18. A method for motion vector extraction and computa- 
tion comprising: 

providing an architecture adapted to overlap an image 
data extraction process with a computation process to 
provide 2-frame store decode with letter box scaling 
capability, to extract a plurality of parameters usable for 
calculating a motion vector, and to compute vertical 
and horizontal components of motion vectors in back to 
back cycles, wherein said overlap allows (i) said image 
extraction process to operate on a first block of data and 
(ii) said computation process to simultaneously operate 
on a second block of data. 

19. The method for motion vector extraction and compu- 
tation of claim 18 wherein: 

said architecture comprises a delta compute engine 
adapted to generate a delta from a motion code and a 
motion residual and to generate a predicted motion 
vector in consideration of a motion vector of a previous 
macroblock. 
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20. The method for motion vector extraction and compu- 
tation of claim 18 wherein: 

said architecture comprises a raw vector compute engine 
adapted to generate a raw vector from a delta and a 
predicted motion vector. 5 

21. The method for motion vector extraction and compu- 
tation of claim 18 wherein: 

said architecture comprises a motion vector with respect 
to top left corner of picture block adapted to generate 
luma and chroma motion vectors from a raw vector and 30 
macroblock coordinate information. 

22. A system for motion vector extraction and computa- 
tion comprising: 

an architecture adapted to overlap an image extraction 15 
process with a computation process and to provide 
2-frame store decode with letter box scaling capability, 
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to extract a plurality of parameters usable for calculat- 
ing a motion vector, and to computer motion vectors, 
said architecture comprising a motion vector computer 
pipeline, wherein said overlap allows (i) said image 
extraction process to operate on a first block of data and 
(ii) said computation process to simultaneously operate 
on a second block of data; 

said architecture being adapted to compute vertical and 
horizontal components of motion vectors in back to 
back cycles; and 

said motion vector computer pipeline comprises logic 
which is adapted to operate without inputs from a CPU 
or a microcontroller. 

* * * * * 
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